Cell graph neural networks enable the precise prediction of patient survival in gastric cancer

Gastric cancer is one of the deadliest cancers worldwide. An accurate prognosis is essential for effective clinical assessment and treatment. Spatial patterns in the tumor microenvironment (TME) are conceptually indicative of the staging and progression of gastric cancer patients. Using spatial patterns of the TME by integrating and transforming the multiplexed immunohistochemistry (mIHC) images as Cell-Graphs, we propose a graph neural network-based approach, termed Cell−Graph Signature or CGSignature, powered by artificial intelligence, for the digital staging of TME and precise prediction of patient survival in gastric cancer. In this study, patient survival prediction is formulated as either a binary (short-term and long-term) or ternary (short-term, medium-term, and long-term) classification task. Extensive benchmarking experiments demonstrate that the CGSignature achieves outstanding model performance, with Area Under the Receiver Operating Characteristic curve of 0.960 ± 0.01, and 0.771 ± 0.024 to 0.904 ± 0.012 for the binary- and ternary-classification, respectively. Moreover, Kaplan–Meier survival analysis indicates that the “digital grade” cancer staging produced by CGSignature provides a remarkable capability in discriminating both binary and ternary classes with statistical significance (P value < 0.0001), significantly outperforming the AJCC 8th edition Tumor Node Metastasis staging system. Using Cell-Graphs extracted from mIHC images, CGSignature improves the assessment of the link between the TME spatial patterns and patient prognosis. Our study suggests the feasibility and benefits of such an artificial intelligence-powered digital staging system in diagnostic pathology and precision oncology.


GNN model architectures
When designing the GNN model architectures, we considered two different types of convolutional unit, including GCNConv and GINConv. The major difference between GCNConv and GINConv is reflected by the different mechanism of the message passing (i.e. node feature passing), as illustrated in Figure 2. More specifically, GCN 1 is the graph convolutional network, which calculates the node features by aggregating features of the node and its neighbors, as shown in Figure 2a. In contrast, GIN 2 is the graph isomorphism network, which adds an extra multilayer perceptron to generate the outputs, as shown in Figure 2b. The graph convolution needs to be combined with pooling layers. We tested the performance of models with two types of pooling layers: TopKPooling 3-5 and SAGPooling 5, 6 . They provide an effective way to preserve the critical graph features and structures by using two different ways of calculating the projection matrix of node scores. TopKPooling calculates y = softmax(X p) with the trainable projection weight p, while SAGPooling uses a GNN to extract the ranking score for the nodes by y = softmax(GNN(X, A)).
Supplementary Figure 1. An overview of the key procedures for multiplexed staining and image processing to generate Cell-Graphs. A detailed description of each step can be seen in the corresponding panel of the figure.
Metrics for evaluating the model performance The model performance was evaluated with the following commonly used metrics and measures, including Accuracy (ACC), F1-score, Matthews Correlation Coefficient (MCC), and receiver-operating characteristic (ROC) curve with the corresponding area under the ROC curve (AUROC). These performance metrics are defined by the following equations: Here, TP represents true positive; TN, true negative; FP, false positive; FN, false negative. For all the measures defined above, a higher value indicates a better performance of the model. The corresponding AUC values are calculated as the primary performance metric to evaluate the performance of the trained models and compare between different methods.

Strategies to prevent overfitting
Due to the huge size of cancer histopathology images, segmentation of the images at the patch-or tile-levels prior to deep learning-based model training is a common practice in digital pathology. To avoid the overfitting issue, in this study we applied multiple strategies: 1) use of a strict early stopping strategy; 2) dynamic learning rate, and 3) use of pooling layers, to effectively avoid the model overfitting. Moreover, the training and validation loss value changes also provide useful clues in regards to whether or not the trained deep learning model was subjected to overfitting. Figures 10-19 show the detailed training and validation loss value changes for binary-and ternary-classification model training on five-fold cross-validation. The position of the early stopping was indicated with the red dash line in each figure. As can be observed in these figures, all the models was stopped when the validation loss stopped further decreasing with the patience of 20 epochs. For most models, they required training for 30 to 80 epochs to achieve the optimal. Due to the adopted multiple strategies and results, our models were unlikely to get overfitted.

Supplementary tables and figures
Supplementary  Table 3. Low-pass wavelet decomposition of short-term, medium-term, and long-term samples with features of Pan-CK and Cell Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high-valued coefficient and blue represents low-valued coefficient. The low-pass composition of features of Pan-CK and Cell Area is shown. No significant differences of short-term, medium-term, and long-term patients can be observed from the following figures.
Pan-CK Cell Area

Short-Term
Medium-Term

Long-Term
Supplementary Table 4. Low-pass wavelet decomposition of short-term, medium-term, and long-term samples with features of Cytoplasm Area and Nucleus Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the low-pass composition on features of Cytoplasm Area and Nucleus Area, no significant differences of short-term, medium-term, and long-term patients can be observed from the figures in this table.
Cytoplasm Area Nucleus Area

Short-Term
Medium-Term

Long-Term
Supplementary Table 5. Low-pass wavelet decomposition of short-term, medium-term, and long-term samples with features of Nucleus Perimeter and Nucleus Roundness. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the low-pass composition on features of Nucleus Perimeter and Nucleus Roundness, no significant differences of short-term, medium-term, and long-term patients can be observed from the following figures.

Nucleus Perimeter Nucleus Roundness
Short-Term

Long-Term
Supplementary Table 6. High-pass (channel 1) wavelet decomposition of short-term, medium-term, and long-term samples with features of Pan-CK and Cell Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. The figures below show the high-pass (channel 1) composition of features of Pan-CK and Cell Area. Major color differences of short-term (dominated by red color), medium-term (red and blue mixed), and long-term (dominated by blue) patients can be observed from the figures in the Table. Pan-CK Cell Area

Short-Term
Medium-Term

Long-Term
Supplementary Table 7. High-pass (channel 1) wavelet decomposition of short-term, medium-term, and long-term samples with features of Cytoplasm Area and Nucleus Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the high-pass (channel 1) composition on features of Cytoplasm Area and Nucleus Area, major color differences of medium-term (mixed color of red and green) from short-and long-term (dominated by single color of red or blue) patients can be observed from the following figures. Supplementary Table 8. High-pass (channel 1) wavelet decomposition of short-term, medium-term, and long-term samples with features of Nucleus Perimeter and Nucleus Roundness. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the high-pass (channel 1) composition on features of Nucleus Perimeter and Nucleus Roundness. For feature Nucleus Perimeter, major color differences of short-(dominated by blue color), medium-(mixed color of red and green) and long-term (dominated by red color) patients can be observed, while only differences of medium-term (mixed color of red and green) from the patients of the other two classes (dominated by red color) can be seen on the decomposition of Nucleus Roundness.

Nucleus Perimeter Nucleus Roundness
Short-Term

Medium-Term
Long-Term Supplementary Table 9. High-pass (channel 2) wavelet decomposition of short-term, medium-term, and long-term samples with features of Pan-CK and Cell Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the high-pass (channel 2) composition on features of Pan-CK and Cell Area, major color differences of short-term (dominated by single color of red or blue), medium-term (mixed color from red to blue), and long-term (dominated by single color of blue or red) patients can be observed from the figures.

Pan-CK Cell Area
Short-Term

Medium-Term
Long-Term Supplementary Table 10. High-pass (channel 2) wavelet decomposition of short-term, medium-term, and long-term samples with features of Cytoplasm Area and Nucleus Area. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the high-pass (channel 2) composition on features of Cytoplasm Area and Nucleus Area, major color differences of medium-term (mixed color of red and green) patient from short-and long-term (dominated by single color of red or blue) patients can be observed from following figures.

Cytoplasm Area Nucleus Area
Short-Term

Medium-Term
Long-Term Supplementary Table 11. High-pass (channel 2) wavelet decomposition of short-term, medium-term, and long-term samples with features of Nucleus Perimeter and Nucleus Roundness. The decomposition was conducted on one low-pass channel and two high-pass channels. In each figure, the decomposition coefficients were visualized by gradient color from red to blue, where red represents the high coefficient and blue represents low coefficient. Here shows the high-pass (channel 2) composition on features of Nucleus Perimeter and Nucleus Roundness. For feature Nucleus Perimeter, major color differences of short-(dominated by blue color), medium-(mixed color of red and green) and long-term (dominated by red color) patients can be observed, while only differences of medium-term (mixed color of red and blue) from the patients of the other two classes (dominated by red color) can be seen on the decomposition of Nucleus Roundness.

Nucleus Perimeter Nucleus Roundness
Short-Term

Medium-Term
Long-Term